Skip to content

cuda.bindings latency benchmarks - part 2#1856

Open
danielfrg wants to merge 6 commits intomainfrom
cuda-bindings-bench-more
Open

cuda.bindings latency benchmarks - part 2#1856
danielfrg wants to merge 6 commits intomainfrom
cuda-bindings-bench-more

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

@danielfrg danielfrg commented Apr 3, 2026

Description

closes #1580

Follow up #1580

Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg force-pushed the cuda-bindings-bench-more branch from 0cfea1d to a3f0678 Compare April 3, 2026 15:25
@danielfrg
Copy link
Copy Markdown
Contributor Author

There are the results for a run on my dev machine (4090)


----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns          112 ns     +106 ns
ctx_device.ctx_get_device                         8 ns          122 ns     +113 ns
ctx_device.ctx_set_current                        8 ns          103 ns      +96 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   9 ns          195 ns     +186 ns
event.event_create_destroy                       90 ns          307 ns     +218 ns
event.event_query                                74 ns          215 ns     +140 ns
event.event_record                               93 ns          229 ns     +136 ns
event.event_synchronize                          94 ns          239 ns     +145 ns
launch.launch_16_args                          1.57 us         3.12 us    +1545 ns
launch.launch_16_args_pre_packed               1.58 us         1.99 us     +409 ns
launch.launch_empty_kernel                     1.54 us         1.85 us     +302 ns
launch.launch_small_kernel                     1.54 us         2.23 us     +690 ns
pointer_attributes.pointer_get_attribute         29 ns          511 ns     +482 ns
stream.stream_create_destroy                   3.78 us         4.06 us     +274 ns
stream.stream_query                              86 ns          232 ns     +145 ns
stream.stream_synchronize                       111 ns          257 ns     +146 ns
----------------------------------------------------------------------------------

@danielfrg danielfrg requested review from mdboom and rwgk April 3, 2026 16:52
@danielfrg danielfrg self-assigned this Apr 3, 2026
@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 6, 2026

It's the first time that I'm looking at this code. My second question was: What's the purpose of the benchmarks. Cursor (GPT-5.4 Extra High Fast) offered some answers. I asked it to generate a "Motivation" section based on what it found, see below. I think it'd be a great addition to cuda_bindings/benchmarks/README.md.


Motivation

These benchmarks are intended to measure the latency overhead of calling CUDA Driver APIs through cuda.bindings, relative to a similar C++ baseline.

The main goal is to help answer questions such as:

  • How much overhead does the Python binding layer add to very small CUDA API calls?
  • Are we staying within our target of keeping Python overhead below roughly 1 us for representative operations?
  • Do changes to argument conversion, result handling, or wrapper internals introduce measurable regressions?

The paired C++ benchmarks are included to provide a lower-level reference point for the same operation. Comparing Python and C++ results helps estimate the additional cost introduced by the Python-to-C boundary and by binding-specific marshalling work.

These benchmarks are not intended to measure overall GPU performance, kernel throughput, or end-to-end application speed. Most of the benchmarked operations are deliberately tiny, so the reported numbers are best interpreted as binding/API-call latency measurements and regression signals, rather than as predictions of full application performance.

Because the benchmarked operations are so small, methodology matters a lot. The most useful comparisons are between Python and C++ benchmarks that perform as nearly the same work as possible and are run under similar conditions.

@rwgk
Copy link
Copy Markdown
Collaborator

rwgk commented Apr 6, 2026

My first question (to Cursor) when reviewing this PR was:

I'm not very familiar with benchmarking. I glanced through the PR, it seems to set up pairs of equivalent C++ / Python tests, runs both, and then computes the overhead resulting from running via Python. Could you please look carefully of the C++ / Python pairs are actually equivalent? Is there anything that looks like it would lead to distorted/systematically biased results?

After it gave me the response below I started thinking about the motivation, with the result in the previous comment. In light of that, the findings below still seem relevant, but I'd need to look closer to be more certain which of the "not clean apples-to-apples" aspects it found are actually meaningful. I hope they are at least a good starting point for figuring it out together, so I'm copy-pasting them below.


Findings

  • High: event.event_query and event.event_synchronize are not actually paired with the Python versions. The C++ binary records onto the shared event first at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:60, then benchmarks query/synchronize on that same handle at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:67 and cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:75 without restoring the original completed-event state. The Python side sets up a completed event at cuda_bindings/benchmarks/benchmarks/bench_event.py:16 and cuda_bindings/benchmarks/benchmarks/bench_event.py:17, and each benchmark runs in its own pyperf worker from cuda_bindings/benchmarks/runner/main.py:132, so event_query and event_synchronize start clean there. That makes those rows non-equivalent.

  • High: The launch.* benchmarks are not clean apples-to-apples overhead measurements. Both sides enqueue asynchronously with no sync/reset inside the timed loop at cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, cuda_bindings/benchmarks/benchmarks/bench_launch.py:77, cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, cuda_bindings/benchmarks/benchmarks/bench_launch.py:102, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:136, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:147, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:157, and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:169. Python is driven through pyperf at cuda_bindings/benchmarks/runner/main.py:132, which calibrates loop counts, while C++ uses fixed loops = 1000 at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21. So the two sides are almost certainly measuring different queue-depth and backpressure regimes. On top of that, the later C++ launch rows reuse the same stream sequentially from cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:112, so they inherit prior queued work in a way the Python rows do not.

  • Medium: Even the synchronous rows are not measured under the same harness. Python uses pyperf.Runner().bench_time_func(...) at cuda_bindings/benchmarks/runner/main.py:132, while C++ uses one in-process custom suite with fixed warmups, values, and runs at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:22, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:23, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:24, and cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:103. For 6-100 ns calls, that is enough to change timer overhead, loop overhead, and process-isolation behavior, so the delta is not a pure wrapper-overhead number.

  • Medium: The C++ baseline includes extra hot-loop work that Python does not. Most C++ rows wrap each CUDA call in check_cu(...) at cuda_bindings/benchmarks/benchmarks/cpp/bench_ctx_device.cpp:42, cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:53, cuda_bindings/benchmarks/benchmarks/cpp/bench_stream.cpp:46, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:135, and cuda_bindings/benchmarks/benchmarks/cpp/bench_pointer_attributes.cpp:45. The Python loops just discard the returned status at cuda_bindings/benchmarks/benchmarks/bench_ctx_device.py:22, cuda_bindings/benchmarks/benchmarks/bench_event.py:29, cuda_bindings/benchmarks/benchmarks/bench_stream.py:23, cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, and cuda_bindings/benchmarks/benchmarks/bench_pointer_attributes.py:24. That makes the C++ side a little slower and systematically understates Python overhead, especially on the fastest context/device rows.

  • Medium: launch.launch_16_args is intentionally not equivalent to the C++ row. Python passes (_args, _arg_types) at cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, so the binding has to marshal Python arguments every call. C++ launch_16_args and launch_16_args_pre_packed are the same pre-packed void*[] launch at cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:155 and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:167. The launch_16_args_pre_packed row is the closer wrapper-only comparison; the plain launch_16_args row measures Python marshalling overhead too.

What Looks Reasonably Matched

  • ctx_device.*, pointer_attributes.pointer_get_attribute, stream.stream_*, and event.event_create_destroy look structurally close in API choice and setup.

  • Their exact nanosecond deltas should still be treated cautiously because of the harness mismatch and the extra check_cu(...) cost on the C++ side.

  • event.event_record is closer than the other async rows, but it still suffers from the async queue-depth mismatch between pyperf-calibrated Python runs and fixed-loop C++ runs.

Bottom Line

  • Some pairs are fine directionally, but the PR is not consistently comparing truly equivalent C++ and Python work.

  • event.event_query, event.event_synchronize, and the launch.* rows should not be treated as clean Python-overhead-over-equivalent-C++ measurements.

  • If the intent is a defensible binding-overhead study, the biggest fixes are: use the same harness model on both sides, reset and synchronize async state between values and between named benchmarks, and either remove check_cu(...) from the timed C++ loop or add equivalent explicit status handling on the Python side.

Note

  • This review was based on code inspection plus a local check of the pyperf worker and loop behavior used by cuda_bindings/benchmarks/runner/main.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python latency testing & benchmarking

2 participants